Tunis Governorate
Language Model Tokenizers Introduce Unfairness Between Languages
Recent language models have shown impressive multilingual performance, even when not explicitly trained for it. Despite this, there are concerns about the quality of their outputs across different languages. In this paper, we show how disparity in the treatment of different languages arises at the tokenization stage, well before a model is even invoked. The same text translated into different languages can have drastically different tok-enization lengths, with differences up to 15 times in some cases. These disparities persist even for tokenizers that are intentionally trained for multilingual support.
Programming in Assembly Is Brutal, Beautiful, and Maybe Even a Path to Better AI
Whether your chip is running a vintage computer game or the latest DeepSeek model, it'll reward you for speaking its native language. But if you took a look beneath the pixels--the rickety rides, the crowds of hungry, thirsty, barfing people (and the janitors mopping in their wake)--deep down at the level of the code, you saw craftsmanship so obsessive that it bordered on insane. Chris Sawyer, the game's sole developer, wrote the whole thing in assembly. Because if/when the machines take over, we should at least speak their language. Certain programming languages, like Python or Go or C++, are called "high-level" because they work sort of like human language, written in commands and idioms that might fit in at a poetry slam.
Global Sumud Flotilla reports drone attack on Gaza-bound ship in Tunisia
How dangerous is the situation in the West Bank? What does survival look like inside Gaza City? The Gaza-bound Global Sumud Flotilla (GSF) says a drone has struck its main ship in the Tunisian port of Sidi Bou Said, causing a fire, but that all its passengers and crew were safe. A spokesman for the GSF blamed Israel for the incident, which occurred late on Monday, but the Tunisian National Guard said reports of a drone attack were "completely unfounded". The GSF, however, insisted the incident was a drone attack and said it would provide more details on Tuesday morning.
Working Document -- Formalising Software Requirements with Large Language Models
Beg, Arshad, O'Donoghue, Diarmuid, Monahan, Rosemary
This draft is a working document, having a summary of nighty-four (94) papers with additional sections on Traceability of Software Requirements (Section 4), Formal Methods and Its Tools (Section 5), Unifying Theories of Programming (UTP) and Theory of Institutions (Section 6). Please refer to abstract of [7,8]. Key difference of this draft from our recently anticipated ones with similar titles, i.e. AACS 2025 [7] and SAIV 2025 [8] is: [7] is a two page submission to ADAPT Annual Conference, Ireland. Submitted on 18th of March, 2025, it went through the light-weight blind review and accepted for poster presentation. Conference was held on 15th of May, 2025; [8] is a nine page paper with additional nine pages of references and summary tables, submitted to Symposium on AI Verification (SAIV 2025) on 24th of April, 2025. It went through rigorous review process. The uploaded version on arXiv.org [8] is the improved one of the submission, after addressing the specific suggestions to improve the paper.
Markov-Enhanced Clustering for Long Document Summarization: Tackling the 'Lost in the Middle' Challenge with Large Language Models
Amari, Aziz, Ammar, Mohamed Achref Ben
The rapid expansion of information from diverse sources has heightened the need for effective automatic text summarization, which condenses documents into shorter, coherent texts. Summarization methods generally fall into two categories: extractive, which selects key segments from the original text, and abstractive, which generates summaries by rephrasing the content coherently. Large language models have advanced the field of abstractive summarization, but they are resource-intensive and face significant challenges in retaining key information across lengthy documents, which we call being "lost in the middle". To address these issues, we propose a hybrid summarization approach that combines extractive and abstractive techniques. Our method splits the document into smaller text chunks, clusters their vector embeddings, generates a summary for each cluster that represents a key idea in the document, and constructs the final summary by relying on a Markov chain graph when selecting the semantic order of ideas.
Konooz: Multi-domain Multi-dialect Corpus for Named Entity Recognition
Hamad, Nagham, Khalilia, Mohammed, Jarrar, Mustafa
We introduce Konooz, a novel multi-dimensional corpus covering 16 Arabic dialects across 10 domains, resulting in 160 distinct corpora. The corpus comprises about 777k tokens, carefully collected and manually annotated with 21 entity types using both nested and flat annotation schemes - using the Wojood guidelines. While Konooz is useful for various NLP tasks like domain adaptation and transfer learning, this paper primarily focuses on benchmarking existing Arabic Named Entity Recognition (NER) models, especially cross-domain and cross-dialect model performance. Our benchmarking of four Arabic NER models using Konooz reveals a significant drop in performance of up to 38% when compared to the in-distribution data. Furthermore, we present an in-depth analysis of domain and dialect divergence and the impact of resource scarcity. We also measured the overlap between domains and dialects using the Maximum Mean Discrepancy (MMD) metric, and illustrated why certain NER models perform better on specific dialects and domains. Konooz is open-source and publicly available at https://sina.birzeit.edu/wojood/#download
Formalising Software Requirements using Large Language Models
Beg, Arshad, O'Donoghue, Diarmuid, Monahan, Rosemary
This paper is a brief introduction to our recently initiated project named VERIFAI: Traceability and verification of natural language requirements. The project addresses the challenges in the traceability and verification of formal specifications through providing support for the automatic generation of the formal specifications and the traceability of the requirements from the initial software design stage through the systems implementation and verification. Approaches explored in this project include Natural Language Processing, use of ontologies to describe the software system domain, reuse of existing software artefacts from similar systems (i.e. through similarity based reuse) and large language models to identify and declare the specifications as well as use of artificial intelligence to guide the process.